How Specialized are Specialized Corpora? Behavioral Evaluation of Corpus Representativeness for Maltese
نویسندگان
چکیده
In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data. 1. Developing corpus resources for low-density languages The benefits of developing language corpora that provide readily access to quantifiable language use in naturalistic settings have been widely embraced by many scholars in a diverse set of language disciplines. In addition to the strong tradition of corpus analysis and development in applied research (dictionary (Sinclair, 1987), language teaching (Biber and Conrad, 2001), translation (McEnery and Xiao, 2007) and machine-learning applications), corpus data are increasingly employed in typical theoretical investigation, in particular psycholinguistic studies on language processing (Gilquin and Gries, 2009, for a survey). However, the great majority of the estimated 5-7,000 languages of the world are ‘low-density’, i.e. for which robust language resources are limited, or non-existent (Borin, 2009). This fact highlights an obvious lack of empirical coverage of range of possible linguistic diversity – an obstacle for theoretical and applied applications for particular languages and theoretical investigations more generally. To address this gap, many researchers have focused their efforts on developing resources for low-density languages (LDL) (McEnery et al., 2006; Scannell, 2007, inter alia). Despite best efforts on the part of language researchers, there are unique challenges related to the quality and quantity of available data that researchers must face when developing corpora for LDLs which ultimately may call into question the general applicability of the final product. Whereas access to primary data may be limited both in print and electronic form, creating sometimes insurmountable problems1, language data that is available is often reWe gratefully recognize our colleagues Dr. Albert Gatt (U of Malta) and Jeff Berry (U of Arizona) for their invaluable assistance and acknowledge funding from the United States National Science Foundation (BCS-0715500) to Adam Ussishkin. Difficulties in attaining data do not always stem from the number of speakers of language, but may in fact reflect the interaction of various extra-linguistic factors (cultural, economic, stricted also in terms of its overall representativeness of the target language (i.e. genres/registers, modalities, etc.) (Biber, 1993). Compared to languages such as English language where resources are literally samples of the language, techniques for attaining representativeness for other language corpora are as straight forward given the resources that are available represent almost complete coverage of all language data in existence (Scannell, 2007) – in effect, a representativeness bottleneck. Under standard evaluation practices many existing projects are considered ‘specialized’, that is less-than-representative language samples. Accordingly, without some assurance of corpus validity, credible results from low-density language research is limited. However, these smaller and less-diverse language samples do not necessarily misrepresent distributional properties for those linguistic units that have been collected. It is logically possible that some, or all, of the linguistic units contained in the corpus are indeed representative of the larger language body from which it was sampled. Yet the question is, how you know (i.e. determine representativeness)? In what follows we describe a novel approach to evaluate corpus representativeness that exploits the relationship between corpus linguistics and psycholinguistics. 2. Behavioral data as external validation for corpus resources Corpus-based evidence is inherently limited in gauging the relative representativeness of a corpus in an absolute sense. A general characteristic of corpus design and evaluation, this limitation is typically addressed by collecting large amounts of data rigorously sampled from a wide variety of sources. For LDL resources, which often lack accessible resources, this is a pressing issue. In this case an external, non-corpus based metric is needed. We propose that evidence from psycholinguistic investigation based on data
منابع مشابه
Evaluation of terminologies acquired from comparable corpora: an application perspective
This paper describes a protocol for the evaluation of bilingual terminologies acquired from comparable corpora. The aim of the protocol is to assess the terminologies’added-value in a task of specialized translation. The protocol consists in having specialized texts translated in various situations: without any specialized resource, with an domain-related bilingual terminology or using Internet...
متن کاملBilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora
Bilingual lexicon extraction from comparable corpora is constrained by the small amount of available data when dealing with specialized domains. This aspect penalizes the performance of distributionalbased approaches, which is closely related to the reliability of word’s cooccurrence counts extracted from comparable corpora. A solution to avoid this limitation is to associate external resources...
متن کاملEvaluation of an automatic process for specialized web corpora collection and term extraction for Basque
In this paper we describe the processes for collecting Basque specialized corpora in different domains from the Internet and subsequently extracting terminology out of them, using automatic tools in both cases. We evaluate the results of corpus compiling and term extraction by making use of a specialized dictionary recently updated by experts. We also compare the results of the automatically co...
متن کاملLooking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction
The main work in bilingual lexicon extraction from comparable corpora is based on the implicit hypothesis that corpora are balanced. However, the historical contextbased projection method dedicated to this task is relatively insensitive to the sizes of each part of the comparable corpus. Within this context, we have carried out a study on the influence of unbalanced specialized comparable corpo...
متن کاملA corpus-based approach to the multimodal analysis of specialized knowledge
Modern communication environments have changed the cognitive patterns of individuals, who are now used to the interaction of information encoded in different semiotic modalities, especially visual and linguistic. Despite this, the main premise of Corpus Linguistics is still ruling: our perception of and experience with the world is conveyed in texts, which nowadays need to be studied from a mul...
متن کامل